Parallel object detection method for heterogeneous multithreaded microarchitectures

ABSTRACT

A parallel object detection method for heterogeneous microarchitectures. The method is designed for increasing the throughput of object detection in a computer system that is equipped with an array of cores including a shared memory, a constant memory, and functional units. Latency reduction is achieved through a multilevel parallelization method that exploits fine-grain data-level parallelism using multithreaded SIMD computations, and coarse-grain parallelism by relying on concurrent kernel execution.

FIELD OF THE INVENTION

The present invention pertains to the field of parallel processing. Inparticular, relates to parallel object detection methods in images forheterogeneous multithreaded microarchitectures such as those thatcombine graphics processing units with central processing units.

BACKGROUND OF THE INVENTION

Modern computer systems usually include, without limitation, a centralprocessing unit (CPU), a graphics processing unit (GPU) and severalinput or output devices. Typically, the CPU is designed for executinggeneral purpose software whereas the GPU is specifically optimized forperforming 3D rendering computations such as texture mapping orgeometric transformations. State of the art CPU microarchitecturesinclude one or more processing cores with the purpose of exploiting bothcoarse-grain and fine-grain thread-level parallelism (TLP). Coarse-grainparallelism is achieved by concurrently executing several computingtasks or processes in the available general purpose CPU cores. This loadbalancing distribution is implemented in the operating system (OS)scheduler by assigning to each process an execution time slot for eachcore. On the other hand, fine-grain TLP is implemented in the hardwareof each core and tries to minimize the underutilization of thefunctional units by simultaneously fetching and executing instructionsfrom different threads. This technique is known as simultaneousmultithreading and is extensively used in current CPU designs formaximizing the performance of parallel applications and OS processes.

Examples of workloads that potentially benefit from parallelism arecomputer vision algorithms, particularly object detection methods. Thesetechniques determine the location of specific objects such as trafficsigns, handwritten characters or even human faces within an image orvideo frame. One of the most widely used methods for performing objectdetection relies on a boosted cascade of classifiers. This cascadearranges a set of weak classifiers sequentially with the purpose ofbuilding an aggregated strong classifier. This approach facilitates therejection of negative candidates at early stages of the cascade, thusquickly discarding image regions that are not likely to contain thedesired objects. Even though the hierarchical nature of the boostedcascade prevents any attempt of inter-stage parallelization, it is stillpossible to evaluate different image regions in parallel just byassigning them to different CPU threads.

Since these threads need to perform a huge amount of arithmeticoperations, the overall object detection latency could be dramaticallyreduced if the amount of arithmetic and logic units (ALUs) within eachCPU core were increased. Unfortunately, the flat memory access model andcomplex out-of-order execution engines offered by CPUs tends to spendlarge portions of the chip die in big caches, buffers and speculationlogic, thus reducing the available area for additional functional units.Unlike general-purpose CPUs, stream processor microarchitectures such asGPUs try to exploit data-level parallelism (DLP) and adopt a radicallydifferent memory hierarchy with small-sized on-die shared memories inwhich data locality is managed by the programmer. The footprint in termsof spent die area for this approach is much lesser and therefore moretransistors are devoted to increase the number of available ALUs. Inorder to maximize the utilization of these ALUs, modern GPUs implementDLP through Single Instruction Multiple Data (SIMD) instructions thatare executed in an array of lightweight multithreaded cores. These coresare organized in clusters in such a manner that data locality andsynchronization within the cluster is achieved by using a shared memory.

Heterogeneous microarchitectures combine the characteristics of bothCPUs and GPUs, usually in the same chip die. These designs offer amassively parallel multithreaded execution engine that is tightlycoupled with one or more general purpose out-of-order processing cores.With the emergence of such technology, there is a need for a parallelobject detection method that fully exploits the computing capabilitiesof the underlying hardware. The efficient usage of both coarse-grain andfine-grain parallelism within all the steps involved during theobjection detection process would maximize the occupancy of theavailable SIMD processing units, thus decreasing the latency of imageanalysis. This increased detection throughput enables the real-timeprocessing of high resolution images and video frames that feature alarge amount of objects (e.g. human faces) in scenarios such as highlycrowded environments.

SUMMARY OF THE INVENTION

Embodiments of the present invention provide a computer-implementedmethod for efficiently performing detections of variable sized objectsin images under heterogeneous multithreaded architectures. Thethroughput of such systems is heavily dependent on the proper balance ofthread computations among the plurality of cores and SIMD processingunits available. Therefore, there exists a need for an object detectionmethod that fully exploits the underlying architecture of thosemassively parallel computer systems, in order to accelerate thistime-consuming process.

One embodiment of the present invention sets for a method for performingobject detection by exploiting multiple levels of parallelism. Themethod includes the steps of generating a plurality of downscaled imagesfrom the input image, and executing a plurality of data-parallel kerneloperations to determine whether an image region contain an object ornot.

According to one aspect of the present invention, the said kerneloperations involve the computation of summed area tables through itsdecomposition into a plurality of data-parallel prefix sum, paralleltransposition operations, and parallel evaluation of a cascade ofclassifiers.

According to another aspect of the present invention, all theabovementioned kernel operations are concurrently scheduled and executedin the computer system. Therefore, fine-grain parallelism is exploitedwithin a kernel by performing computations using a plurality of threads,which themselves rely on a plurality of SIMD instructions for exploitingdata-level parallelism in the functional units. Moreover, coarse-grainparallelism is exploited by concurrently executing the said kerneloperations in the plurality of cores.

According to another aspect of the present invention, the summed areatable is split into a plurality of table chunks, and then stored in theshared memory of each core in order to improve the data locality, andthus increase the memory bandwidth during the cascade evaluationprocess. Similarly, the cascade classifiers are stored in the constantmemory available in each core before starting the required computations.

One advantage of the disclosed method is that achieves the fulloccupancy of both the cores and the functional units of theabovementioned computer system, while reducing the latency of memoryoperations through an efficient usage of the underlying cache hierarchy.Thus, the overall speed and throughput of the object detection processis dramatically improved relative to prior art techniques.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates the block diagram of a computer system thatrepresents a heterogeneous microarchitecture according to an embodimentof the present invention;

FIG. 2 is a block diagram of the parallel processing unit depicted inthe computer system of FIG. 1;

FIG. 3 shows a block diagram with further details of the parallelprocessing unit according to an embodiment of the present invention;

FIG. 4 illustrates the computation phase of the summed area table for agiven image, in accordance with one or more aspects of the presentinvention;

FIG. 5 illustrates the steps involved in the parallel computation ofobject detection in the computer system depicted in FIG. 1, according toone embodiment of the present invention;

FIG. 6 shows the computation steps required for the computation of theparallel exclusive prefix sum of steps 703 and 706 depicted in FIG. 7,according to one embodiment of the present invention;

FIG. 7 illustrates the computation steps required for the parallelcomputation of the summed area table, according to one embodiment of thepresent invention;

FIG. 8 depicts the system memory chunks transferred to the shared memoryof a given core of the parallel processing unit, according to oneembodiment of the present invention;

FIG. 9 shows the programming model used for decomposing parallelcomputations into thread blocks that target the parallel processingunit, according to one embodiment of the present invention;

FIG. 10 depicts the flowchart with all the steps required for theparallel evaluation of the boosted cascade of classifiers, according toone embodiment of the present invention;

FIG. 11 illustrates the differences between serial and concurrent kernelexecution in the cores of the parallel processing unit, in accordancewith one or more aspects of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Embodiments of the present invention provide a parallel object detectionmethod for efficiently exploiting the hardware resources ofheterogeneous multithreaded microarchitectures such as those thatcombine CPUs with GPUs. Therefore, the problem to be solved involves thedecomposition of the elemental operations performed during the objectdetection process into several parallel computations using threads insuch a way that maximizes the overall throughput.

System Overview

An embodiment of a heterogeneous multithreaded microarchitecture isshown as a computer system block diagram in FIG. 1. The computer systemincludes N central processing units or CPU cores 101 and a bus pathcommunicating to the external system memory 105 via a memory controller104 and a multiprocessor interconnect network 103. The memory controller104 may implement a DRAM interface (e.g. DDR4) which is shared with theparallel processing unit (PPU) 102. Each one of the available CPU cores101 is designed for executing general purpose code and may implement acombination of any of the existing techniques for exploiting bothinstruction and thread level parallelism (e.g., out of order executionand simultaneous multithreading). The PPU 102 is a massively parallelengine that is specifically conceived for performing computations ofworkloads that benefit from data level parallelism (DLP). Theseworkloads may range from graphics computations such as geometrictransformations (usually performed in shaders) to fluid dynamicsimulations or computational finance. The PPU 102 could be implementedin hardware as a programmable accelerator that features stream computingcapabilities or a graphics processing unit (GPU).

The architectural block diagram shown in FIG. 1 is illustrative and maybe modified as desired. For instance, in some embodiments, the CPU cores101 and the PPU 102 could use decoupled memories instead of accessing aunified system memory. In this case, the PPU 102 includes its own memorycontroller 104 and is thus directly connected to its correspondingsystem memory 105. Referring again to FIG. 1, the depicted block diagramis meant to be implemented in the same chip. Therefore, it could beintegrated in a package on package (PoP) system on chip (SoC) acrossmultiple layers or even in a single chip die. However, in particularembodiments, variations may exist. For example, a given heterogeneousmicroarchitecture implemented on a chip could not include the systemmemory 105 and access one or more off-die memory DRAM chips.

As is shown in detail FIG. 2, the PPU 102 is itself composed of an arrayof K processing cores 203. Each one of these cores differ from the CPUcores 101 in the sense that PPU cores are designed to execute a largenumber of threads 903 in parallel, which cooperatively performcomputations on a particular set of input data. The PPU 102 alsofeatures a host interface 204 that exchanges memory request packetsoriginated from its cores 203 with the system memory 105 through thenetwork interconnect 103. The data parallel computations to be executedin the PPU 102 are issued from the CPU cores 101 in the form ofprocessing tasks and directed to a front end unit 201 that may bederived from a conventional design. The inner workings of the front endunit 201 are omitted as not being critical for the present invention.All thread instructions that constitute processing tasks are scheduledto the PPU cores 203 via the work distribution unit 202. This unit issomehow similar to the issue queue of a conventional CPU, and thus mustdistribute thread instructions as the computational resources of the PPUcores 203 become available.

FIG. 3 illustrates a block diagram of a PPU core 203. Each PPU core 203implements a computation engine by combining P functional units 303which are specifically designed for executing data parallel SIMDinstructions. Since these cores 203 are multithreaded (i.e. beingcapable of concurrently executing M threads 903), a PPU core 203 in thisembodiment executes M*P threads 903 concurrently on each clock cycle.The data to be processed by each PPU core 203 is usually accessed fromthe shared memory 305 even though it could be accessed from the systemmemory 105. Therefore, whenever the shared memory 305 is accessed at theexpense of the system memory 105, the executed code will achieve ahigher throughput derived from the reduced latency and increasedbandwidth of the underlying memory subsystem. Each PPU core 203 fetchesthread instructions from its instruction unit 301, the contents of whichare managed by the work distribution unit 202.

Similarly, thread instruction operands and computed results are loadedand stored in a register file 304. Even though the thread arrangement isheavily dependent on the programming model used (e.g. OpenCL or CUDA),there are common parameters such as the thread identification numbersand the amount of thread groups used. These values are used by theexecution engine logic of each PPU core 203 and stored in the parametermemory 302 for steering fetched thread instructions into functionalunits 303. This fact is required for ensuring the correctness of thecommitted instructions. In addition, each PPU core 203 also features aconstant memory 306 which is managed by the programming model. Theconstant memory 306 is designed for broadcasting stored values amongthreads 903 that simultaneously execute memory read instructionspointing to the same addresses. For this reason this kind memory is readonly, and its contents written by the programming model before executingany instructions. Additionally, in one embodiment, the contents of theconstant memory 306 of each PPU cores 303 may be the same.

Programming Model

Embodiments of the present invention hierarchically decompose theparallel object detection method by using one or more of the programmingmodel specifications that are available for GPUs, FPGAs or multicoreCPUs. Frameworks such as OpenCL or CUDA offer a standardized abstractionAPI for exploiting parallelism from the underlying microarchitecture.These APIs also implement runtime libraries which transparently managethe scheduling and execution of kernel functions among the differentcores and functional units available in a computer system. Under theseprogramming environments the parallel code must be structured into oneor more kernel functions. The code of the main application is thendivided into some serial work handled by the CPU cores 101 andmultithreaded data-parallel kernel functions which are executed in theparallel processing unit (PPU) 102.

The PPU 102 of the computer system described in FIG. 2, simultaneouslyfetches and executes instructions from multiple kernel functions whichmay process data from different streams. This fact enables theconcurrent execution of different kernel functions thus maximizing theoccupancy of the PPU cores 203 and the application throughput. Thecomparison of both serial and concurrent kernel execution are depictedin FIG. 11. When kernels are serially executed 1101, PPU resources areunderutilized. Therefore, idle PPU cores 203 could be executinginstructions from other kernels instead of ineffectively consumingcomputing cycles. In order to address this issue, the PPU 102 can issueand execute instructions from different kernels concurrently 1102. AsFIG. 11 shows, concurrent execution 1102 may reduce the execution timein situations where computations are unbalanced. These kernel functionsare coded using language constructs that explicitly express data-levelparallelism by using a set of threads 903 or work-item groups.Additionally, each thread 903 shares the same memory address of the mainprogram thus benefiting from a global view of memory. Inter-threadmemory consistency and data locality are manually managed by theprogrammer by loading and storing data in the shared memory of each PPUcore 203.

As it is shown in FIG. 9, kernel functions are hierarchically decomposed901 at different thread granularities in which threads 903 are groupedinto thread blocks 902. In such scheme, each thread block 902 isexecuted in a different PPU core 203 and threads 903 are allowed toperform accesses to the shared memory 305, the constant memory 306, andthe system memory 105. However, thread blocks 902 are only allowed toaccess the shared memory 305 of the PPU core 203 where they are beingexecuted. Additionally, performance can be further increased bycarefully partitioning kernel input data into sublists of thread blocks902 in such a manner that memory accesses are coalesced.

Parallel Object Detection

The process of object detection consists of determining which regions ofa given image are prone to contain a desired object shape (e.g. humanfaces, handwritten characters or traffic signs among others). Many ofthe currently available methods (e.g. the Viola-Jones framework) performthis process using a boosted cascade of classifiers that are previouslytrained for recognizing specific patterns. These patterns characterizethe relevant features of the object and may be based on any otherexisting methods such as SIFT, SURF, Local Binary Patterns (LBP) orsimple Haar filters. The arrangement of the classifiers within thecascade is generally determined using the so-called boosting process andmay involve the usage of a combination of multiple machine learningtechniques. Typically, the boosting process builds a cascade ofclassifiers from an annotated image database with positive and negativeobject examples. This technique is also known as supervised learningand, in order to be effective, requires an aligned database with imageshaving the same width and height. For this reason, the output of thisspecific machine learning process is a boosted cascade of classifiersthat only detects objects whose dimensions are greater or equal to thedimensions of the images in the database. Due to the hierarchical natureof the classifier cascade, it is not required to perform filterevaluations for all the features considered in the detection processthus effectively decreasing the computational footprint. Since filterevaluation usually involves determining the area between pixel regionsand these areas may overlap, it is common to use a summed area table 401for avoiding redundant computations.

As it is depicted in FIG. 4, a typical implementation of the summed areatable 401 method involves the computation of the accumulated sum of allpixel intensity values up to a given pixel according to this expression:

$S = {\sum\limits_{i = 0}^{x}{\sum\limits_{j = 0}^{y}{I\left( {x,y} \right)}}}$

The abovementioned summed area table 401 expression could be executed inone of the CPU cores 101 using a conventional sequential implementationthat uses two nested loops. However, this solution does not exploitdata-level parallelism thus underutilizing the abundant computationalresources of the PPU 102. Therefore, one of the alternatives to increasethe object detection throughput would rely on an optimized summed areatable 401 method that targets the PPU 102 and fully exploits bothcoarse-grain and fine-grain parallelism.

FIG. 5 shows the flowchart of the parallel object detection process. Inthis embodiment, the detection method may start from an input image orvideo frame represented as an array of pixels. These image frames areusually retrieved from a single or multiple video streams, which may beoriginated from digital video cameras or simply organized as videofiles. In this scenario, the CPU cores 101 parse the input in parallelby assigning one CPU thread to each video stream. The parsing processhandled by the CPU threads also involves handling the video decoding,which itself could be performed using a multithreaded codec engine thatexploits intra-frame parallelism. The parallel implementation of thisengine is not discussed in detail since it is not critical for thepresent invention. Additionally, the multithreaded video decoderimplementation may be deployed in the form of a shared library or in anyother layer of the OS. Once the video frame decoding process has beencompleted in the CPU cores 101, the parallel object detection methodstarts by fetching a decoded image frame from the CPU thread pool. As itis shown in FIG. 5, one of the steps involves the parallel computationof the summed area table 401 using the PPU 102. This computation isparallelized at the fine-grain level by using prefix sum operations. Theprefix sum is a data-parallel primitive applied to a given vector inwhich each element is generated by computing the sum of the elements upto its index. There are possible variations of this primitive, amongthem the exclusive prefix sum. Given an input sequence A={a₀, a₁, . . ., a_(n−1)}, the exclusive prefix sum primitive A_(exc) is formallydefined as follows:

A _(exc)=(0,a ₀ ,a ₀ +a ₁ ,a ₀ +a ₁ +a ₂ , . . . ,a ₀ +a ₁ +a ₂ + . . .+a _(n−2)}

Starting from a given image it is then straightforward to build itscorresponding summed area table 401 with the help of the exclusiveparallel prefix sum primitive. Let I be a given input n×m image encodedas a matrix:

$I = \begin{pmatrix}i_{00} & i_{01} & \ldots & i_{{0m} - 1} \\i_{10} & i_{11} & \ldots & i_{{1m} - 1} \\\vdots & \vdots & \ddots & \vdots \\i_{n - 10} & i_{n - 11} & \ldots & i_{n - {1\; m} - 1}\end{pmatrix}$

The I_(S) summed area table 401 is then computed by applying first theA_(exc) exclusive prefix sum for each row and then for the columns inparallel. Since this method is meant to be executed in the PPU 102, inorder to preserve data locality in local private caches, it would bemuch better to perform row-wise exclusive prefix sums and two matrixtranspositions instead of both row-wise and column-wise operations.Given this assumption, I_(exc) represents the row-wise exclusive scanand I^(T) the transpose of the image matrix that are used for computingin parallel the I_(S) summed area table 401 as follows:

$I_{exc} = \begin{pmatrix}0 & i_{00} & {i_{00} + i_{01}} & \ldots & \left( {i_{00} + i_{00} + i_{01} + \ldots + i_{{0m} - 2}} \right) \\0 & i_{10} & {i_{10} + i_{11}} & \ldots & \left( {i_{10} + i_{10} + i_{11} + \ldots + i_{{1m} - 2}} \right) \\\vdots & \vdots & \vdots & \ddots & \vdots \\0 & i_{n - 10} & {i_{n - 10} + i_{n - 11}} & \ldots & \left( {i_{n - 10} + i_{n - 10} + i_{n - 10} + i_{n - 11} + \ldots + i_{n - {1\; m} - 2}} \right)\end{pmatrix}$ $I_{exc}^{T} = \begin{pmatrix}0 & 0 & \ldots & 0 \\i_{00} & i_{10} & \ldots & i_{n - 10} \\{i_{00} + i_{01}} & {i_{10} + i_{11}} & \ldots & {i_{n - 10} + i_{n - 11}} \\\vdots & \vdots & \ddots & \vdots \\\left( {i_{00} + \ldots + i_{{0m} - 2}} \right) & \left( {i_{10} + \ldots + i_{{1m} - 2}} \right) & \ldots & \left( {i_{n - 10} + \ldots + i_{n - {1m} - 2}} \right)\end{pmatrix}$ I^(′) = I_(exc)^(T) $I_{exc}^{\prime} = \begin{pmatrix}0 & 0 & 0 & \ldots & 0 \\0 & i_{00} & {i_{00} + i_{10}} & \ldots & \left( {i_{00} + \ldots + i_{n - 20}} \right) \\\vdots & \vdots & \vdots & \ddots & \vdots \\0 & \left( {i_{00} + \ldots + i_{{0m} - 2}} \right) & \begin{matrix}\begin{matrix}\left( {i_{00} + \ldots + i_{{0m} - 2}} \right) \\{{+ \ldots} +}\end{matrix} \\\left( {i_{10} + \ldots + i_{{1m} - 2}} \right)\end{matrix} & \ldots & \begin{matrix}\begin{matrix}\left( {i_{00} + \ldots + i_{{0m} - 2}} \right) \\{{+ \ldots} +}\end{matrix} \\\left( {i_{n - 20} + \ldots + i_{n - {2m} - 2}} \right)\end{matrix}\end{pmatrix}$ I_(S) = I_(exc)^(′ T) $I_{S} = \begin{pmatrix}0 & 0 & \ldots & 0 \\0 & i_{00} & \ldots & {i_{00} + \ldots + i_{{0m} - 2}} \\\vdots & \vdots & \ddots & \vdots \\0 & \left( {i_{00} + \ldots + i_{n - 20}} \right) & \ldots & \begin{matrix}\begin{matrix}\left( {i_{00} + \ldots + i_{{0m} - 2}} \right) \\{{+ \ldots} +}\end{matrix} \\\left( {i_{n - 20} + \ldots + i_{n - {2m} - 2}} \right)\end{matrix}\end{pmatrix}$

The computations described above expose fine-grain parallelism atrow-level. Due to this fact, it is then possible to simultaneouslylaunch concurrent parallel prefix sum operations thus maximizing theoccupancy of the PPU 102. The difference between serial and concurrentkernel execution is illustrated in FIG. 11.

As it is depicted in the flow diagram of FIG. 5, the parallel detectionprocess starts at step 501 by obtaining the input image from the systemmemory 105. Since the evaluation of the boosted cascade of classifiersmust find objects of different sizes, the input image has also to bedownscaled multiple times; one corresponding to each scale considered.Step 502 involves executing in the PPU 102 as many kernel functions inparallel as the number of selected scales. Each kernel function of step502 is then responsible of linearly resizing the input image usingdifferent scaling factors. Additionally, in order to avoid the undesiredeffect of aliasing, which is intrinsic to any scaling process, the usageof smoothing filtering is also required. In step 503 each one of theparallel image downscaling kernels applies any existing filtering method(e.g. bilinear or bicubic filtering, among others). Steps 502 and 503may be combined and implemented in the PPU 102 with a kernel functionthat relies on a divide and conquer approach where each input image issplit into equally-sized image chunks of w×h elements. Additionally, foreach image chunk a block of w×h threads 903 is created. These imagechunks, which respectively correspond to different image subregions, aretransferred in parallel from the system memory 105 to the shared memory305 of the PPU cores 203. If the number of thread blocks 902 is greaterthan the number of cores available in the PPU 102, the remaining threadblocks 902 are dynamically enqueued and dequeued as the PPU 102resources of each core become available. At this point, memoryconsistency among all threads 903 within a block is ensured by executinga barrier instruction in each PPU core 203. Then the scalingcomputations of each image subregion take place by performing theselected smoothing filtering method with fine-grain parallelism in eachthread block 902. In this process, each thread 903 within the blockcomputes the new smoothed value of its corresponding image subregionpixel. Therefore, the number of threads 903 in a given thread block 902equals the number of pixels of the image chunk stored in itscorresponding shared memory 305. Finally, the process concludes bytransferring the filtered image subregions from each PPU 102 sharedmemory 305 to the system memory 105. In another embodiment, the methoddescribed in steps 502 and 503 may be fully implemented in hardwareusing fixed-function logic blocks instead of using kernel functionsimplemented in software. For instance, state of the art GPUs usuallyperform image scaling and filtering using hardware texture units.

Step 504 involves the computation of the summed area table 401 usingdifferent degrees of parallelism for the downscaled images. This processcomputes the I_(s) matrix for each downscaled image in parallel bylaunching the execution of concurrent kernel functions. Therefore, agiven kernel function computes the summed area table 401 for each onethe multiple downscaled images generated in steps 502 and 503 thusexploiting coarse-grain parallelism. Additionally, fine-grainparallelism is achieved through the usage of the A_(exec) exclusiveprefix sum as it has been previously described. It should be noted thatthe A_(exec) function is itself parallelized, and all the steps involvedin the computation of the summed area table 401 will be furtherdiscussed in detail later. The parallel objected detection methodcontinues with the evaluation of the boosted cascade of classifierstaking as an input the summed area tables 401. At step 505 the cascadeof classifiers is evaluated through the usage of another kernel functionwhich is also executed in the PPU 102. Multiple instances of this kernelfunction are launched and executed in parallel for all the summed areatables 401 that have been generated in the previous steps. From a highlevel perspective, there are two possible parallelization strategies forthis evaluation when detecting objects that are larger than those thatwere used for training the cascade. The first one consists of resizingthe cascade filters (i.e. variable-sized sliding window 806) whereas thesecond relies on rescaling the input image instead of the filters (i.e.fixed-sized sliding window 806). In this embodiment, the size of thesliding window 806 is fixed, and as the flow diagram of FIG. 5 shows,relies on image scaling. Therefore, maximizing the occupancy of thefunctional units 303 of the PPU cores 203 is the reason for implementingthis cascade evaluation technique. As the following equation shows, thenumber of potential simultaneous threads T_(n) decreases as the size ofthe sliding window 806 (W_(w)×W_(h)) is increased:

$T_{n} = \left\lceil \frac{I_{S_{w}} \cdot I_{S_{h}}}{W_{w} \cdot W_{h}} \right\rceil$

Let I_(S) be a summed area table 401 of dimensions I_(S) _(w) ×I_(S)_(h) , then the number of threads T_(h) and thus the potentialparallelism will be maximized if the sliding window 806 dimensions(W_(w)×W_(h)) are kept constant and small.

Finally, in step 506 the parallel object detection process concludes byencoding the evaluation results in a matrix and storing it in the systemmemory 105. In such a matrix, each element represents the position ofthe detected object in the input image. For instance, in one embodiment,a given (x, y) matrix element would encode whether a sliding window 806having its top left corner at those coordinates contains an object ornot. Referring again to step 504, each summed area table 401 computed inthe PPU 102 relies on an optimized parallel exclusive prefix sumprimitive for maximizing performance. This primitive hierarchicallydecomposes the input stream into several sublists using a divide andconquer approach at different granularities. The purpose of thisparallelization pattern is to closely accommodate input data streampartitions with the underlying programming model used for PPU 102computations (e.g. OpenCL, CUDA among others).

As it is shown in the flow diagram of FIG. 6, the parallel exclusiveprefix sum primitive computation involves several tasks. The processstarts at step 601 by accessing the input data stream from the systemmemory 105 and continues 602 by partitioning the stream intoequally-sized substreams. In an advantageous embodiment, the size ofeach substream is a multiple of the P-way SIMD engines available in thefunctional units 303 of the PPU cores 203. Therefore, an optimal-sizedsubstream requires M*P elements where M≧1 and constitutes a thread block902. This arrangement allows obtaining a work-efficient parallelimplementation that only needs to perform the minimum number ofadditions, thus increasing throughput. Similarly, PPU 102 occupancy ismaximized due to the fact that all SIMD lanes of the functional units303 are kept busy. Hence, all additions are hierarchically structuredusing the multithreaded programming model depicted in FIG. 9. In orderto maximize memory bandwidth, each substream obtained from the M*Ppartition is transferred at step 603 from the system memory to theshared memories 305 available in each PPU core 203. These memorytransfers are performed in parallel by grouping the substream elementsinto data chunks and subsequently storing each one of them intodifferent shared memories 305. Moreover, thread synchronization throughbarrier instructions is required for ensuring that all transfers to theshared memories 305 have been completed before starting SIMD additions.

At the lowest level, this thread organization computes the exclusiveparallel prefix sum for a given substream of P elements by using log₂ PSIMD instructions, which themselves perform P additions in parallel.These computations are performed independently for each one of thesubstreams at step 604 by using different thread blocks 902. Thus whenthis step concludes, each thread block 902 will have computed anintermediate prefix sum of the substream elements stored in the core'sshared memory 305, where the abovementioned thread block 902 wasexecuted. In order to obtain the complete exclusive parallel prefix sumof the input stream, it is required to perform further additions withthe intermediate results obtained at step 604. These additions areconducted at step 605 where each thread block 902 computes again theremaining sums. Once these parallel additions have concluded, theexclusive parallel prefix sum of the input stream will be fullycompleted. The final step 606 of the flow diagram depicted in FIG. 6,performs the memory transfers of these stream elements from the sharedmemory 305 of each PPU core 203 and subsequently stores them into thesystem memory 105.

Referring again to step 504, the parallel computation of the summed areatable in the PPU 102 is obtained as a result of a combination of aseries of exclusive parallel prefix sum operations and matrixtranspositions. These steps, depicted in FIG. 7, exploit coarse-grainparallelism by concurrently executing the same kernel functions for eachdownscaled image generated at step 503, in accordance with oneembodiment of the present invention. At step 701 the summed area tablecomputation starts by accessing the I input matrix values from thesystem memory 105. The process continues 702 by decomposing the I matrixinto rows with the purpose of exploiting fine-grain parallelism atrow-level. At step 703 an exclusive parallel prefix sum kernel iscomputed for each one of the abovementioned rows, thus producing theI_(exc) matrix. The purpose of this parallelization scheme is to furtherincrease the occupancy of the cores of the PPU 102. Step 704 conducts aparallel transposition of I_(exc) in order to generate the I_(exc) ^(T)matrix. In one embodiment, the method used for performing the parallelmatrix transposition may be based on any of the existing techniques thatwere designed for distributed memory architectures. For instance, theparallelization strategy followed for the transposition could involvethe decomposition of the matrix into equally sized tiles. Then each tilemight be copied in parallel into the shared memory 305 of each PPU core203. Finally, the rows and columns could be exchanged in place from theshared memory 305 in order to maximize bandwidth. At step 705 theI_(exc) ^(T) matrix is decomposed into rows as it was previously done702. Then the exclusive parallel prefix sums are computed for each rowat step 706 using the same techniques conducted at step 703. As a resultof this step, the I′_(exc) matrix is generated. At step 707 a finalparallel transposition of the I′_(exc) matrix is performed conductingthe same strategy followed at step 704. The desired summed area tableI_(S) is thus finally obtained at the end of the computations of step707.

Referring again to FIG. 5, once summed area tables for each downscaledimage have been computed 504, the parallel evaluation of the boostedcascade of classifiers is performed at step 505. In order to fullymaximize the occupancy of PPU cores 203, all the computations areimplemented in a single kernel, which is concurrently launched andexecuted for each downscaled image. All the steps conducted in step 505are illustrated in the flowchart of FIG. 10. At step 1001, a givensummed area table 401 obtained as a result of the abovementioned steps,is split into equally-sized table chunks. The size of these table chunksis experimentally determined in such a manner that maximizes the T_(n)thread count. In this embodiment, for each equally-sized table chunk athread block 902 is created. The number of threads 903 in a thread block902 corresponds to the total number of elements included in a giventable chunk. These threads 903 are initialized and created at step 1002using the programming model specifications illustrated in FIG. 4.Thereafter, elements from a given table chunk are transferred at step1003 from the system memory 105 to the shared memory 305 of each PPUcore 203 using a 4×4 grid pattern. As shown in FIG. 8, in thisembodiment all memory transfers required for storing the 4×4 grid areperformed by executing a single thread block 902, where each thread 903transfers a collection of elements. Starting from a given summed aretable 401 chunk (i,j) 801, not only it is required to transfer to theshared memory 305 all elements of chunk (i,j) 801, but also elementsfrom the immediately adjacent chunks: chunk (i,j+1) 803, chunk (i+1,j)804, and chunk (i+1,j+1) 805. Since each chunk (i,j) 801 is meant to beprocessed by a different PPU core 203, the sliding window 806 startingat a given element 802 from chunk (i,j) 801 may require elements fromadjacent elements too. This is due to the fact that both the slidingwindow 806 and the chunk (i,j) 801 must have the same dimensions.Referring again to FIG. 10, the chunk (i,j+1) 803, chunk (i+1,j) 804,and chunk (i+1,j+1) 805 are transferred from the system memory 105 tothe shared memory 305 of each PPU core 203 at step 1004. Additionally,in one particular embodiment, a barrier instruction may be requiredafter the completion of steps 1003 and 1004 for ensuring that allthreads 903 in a thread block 902 have completed the required memorytransfers.

The fine-grain parallel object detection process starts by extractingfilters from the boosted cascade of classifiers at step 1005. In thisembodiment, the filters are assumed to be stored in the constant memory306 of each PPU core 203 before the process starts. These filters havebeen computed from a selection of visual descriptors. The visualdescriptors may be extracted from images using feature extractionmethods such as SIFT, SURF, LBP or Haar among others. Initially thefeatures are retrieved from the constant memory 306 in parallel by eachthread 903 from a thread block 902 processing a given chunk (i,j) 801.Multiple summed area table 401 chunks are subsequently scanned inparallel taking into account the three adjacent chunks, thus each threadblock 902 considers its corresponding 4×4 table chunk as the area ofinterest. At step 1006 each thread 903 from a thread block 902processing a given chunk (i,j) 801, starts evaluating all filters storedin the constant memory 306 iteratively. As FIG. 10 depicts, thissubprocess is implemented in the form of a parallel loop where eachthread 903 independently evaluates all filters for a given slidingwindow 806. Moreover, each filter must not be evaluated more than onceby a given thread 903. All operations required for a single filterevaluation are conducted at step 1006.

The amount and type of operations greatly varies depending on the visualdescriptors and the technique used for extracting them. Typically, theresult of a filter evaluation 1006 is a numeric value that must becompared against a preassigned threshold, which is also stored in theconstant memory 306. In one embodiment, the filter evaluation 1006 isconsidered positive 1007 if the obtained numeric value does not violatethe abovementioned threshold. For instance, the threshold violationcheck may be implemented in a particular embodiment using a conditionalinstruction (e.g. A>B). In case of a threshold violation, the thread 903in charge of a given sliding window 806 triggers an early exit of theloop at step 1011. This technique frees PPU core 203 resources, whichmay be used for issuing and executing instructions from other threads903. If the threshold is not violated, at step 1008 the thread 903checks if all filters stored in the constant memory 306 have beenevaluated. If there are filters that still remain unevaluated, the loopcontinues at step 1009 by selecting one of the remaining filters fromthe constant memory 306. Otherwise, the loop concludes at step 1010 anddetermines the presence of an object in the sliding window 806 evaluatedby the abovementioned thread 903.

It should be noted that all the steps that constitute the flowchartdepicted in FIG. 10 are implemented in a single kernel function. Sincethis kernel function is concurrently executed for a collection of summedarea tables 401 corresponding to different image scales, the output ofthe kernel is the parallel detection of objects at different sizes.Additionally, due to the fact that fine-grain parallelism is exploitedwithin a kernel and coarse-grain parallelism by concurrently executingthe kernel for different scales, PPU 102 underutilization is avoided.

What is claimed is:
 1. A method for detecting variable sized objects inan image using a computer system with an array of cores configured toconcurrently execute a plurality of threads, wherein each core comprisesa shared memory, a constant memory and a plurality of functional units,the method comprising the following steps: calculating a summed areatable from an image, by executing a plurality of exclusive prefix sumkernel functions and at least two transposition kernel functions,decomposing the summed area table into a plurality of table chunks,creating a thread per element of the table chunk and storing each tablechunk in the shared memory of a core, evaluating a plurality ofclassifiers for a plurality of table chunks by executing a cascadeevaluation kernel function to determine whether an object is enclosedwithin the table chunk, wherein each kernel function is assigned to aplurality of block threads, a block thread being executed in a singlecore and having access to the shared memory of the core, each thread ofthe thread block comprising Single Instruction Multiple Data (SIMD)instructions to be processed in a plurality of functional units of thecore.
 2. The method of claim 1, wherein the image for calculating thesummed area table is a downscaled image selectable from a plurality ofdownscaled images obtained by executing a resizing kernel function andby applying a plurality of scale factors to an input image.
 3. Themethod of claim 2, wherein the image is filtered by applying a smoothingfilter.
 4. The method of claim 1, wherein the plurality of classifiersare arranged in a cascade and sequentially evaluated.
 5. The method ofclaim 4, wherein the plurality of classifiers are stored in the constantmemory of each core.
 6. The method of claim 2, wherein calculating thesummed area table is performed concurrently for a plurality ofdownscaled images of an image and further comprises for each downscaledimage: encoding a given downscaled image in a matrix comprising rows andcolumns, executing the exclusive prefix sum kernel function byconcurrently calculating, for each row element of the matrix, anexclusive prefix sum comprising an accumulated sum of row elements toform an intermediate matrix, executing the transposition kernel functionby swapping rows and columns of the intermediate matrix to form atransposed matrix, executing the exclusive prefix sum kernel function byconcurrently calculating, for each row element of the transposed matrix,an exclusive prefix sum comprising an accumulated sum of row elements toform a second intermediate matrix; and executing the transpositionkernel function by swapping rows and columns of the second intermediatematrix to form a summed area table for the downscaled image.
 7. Themethod of claim 6, wherein the plurality of summed area tablescorresponding to the plurality of downscaled images are decomposed intoequally sized table chunks, the size being selected in dependence uponthe size of the object to be detected.
 8. The method of claim 1, whereinthree adjacent table chunks of a given summed area table chunk aretransferred to the shared memory of the core in which the given tablechunk is processed.
 9. The method of claim 1, wherein the number ofelements of the table chunk is a multiple of the number of functionalunits of a core.
 10. A method for detecting variable sized objects in animage using a computer system with an array of cores configured toconcurrently execute a plurality of threads, each core comprising ashared memory, a constant memory and a plurality of functional units,the method comprising the following steps: downscaling the imageaccording to a plurality of scale factors to form a plurality ofdownscaled images, decomposing each downscaled image into a plurality ofequally sized image chunks corresponding to different image subregions,and applying a smoothing filter to each image chunk by creating a threadblock, wherein the image chunk and the thread block have the same numberof elements and wherein each filtered pixel of the image chunk iscomputed by a thread in the thread block.
 11. The method of claim 10,further comprising: calculating a summed area table from each downscaledimage, decomposing each summed area table into a plurality of equallysized table chunks, creating a block thread for the plurality of tablechunks, wherein each table chunk and a corresponding thread block havethe same number of elements, and wherein each element of the table chunkis computed by a thread in the thread block, and evaluating a pluralityof classifiers for each element of the table chunk to determine whetheran object is enclosed within the table chunk.
 12. The method of claim11, further comprising transferring the plurality of equally sized imagechunks to the shared memory of the cores prior to applying the smoothingfilter.
 13. The method of claim 11, further comprising transferring eachtable chunk with the three corresponding adjacent table chunks to theshared memory of each core.
 14. The method of claim 11, furthercomprising retrieving a plurality of classifiers from the constantmemory of each core prior to evaluating the plurality of classifiers.15. The method of claim 11, wherein the step of evaluating classifiersfor each element of the table chunk is performed for a given classifierwhen the result of the previous classifier is positive.
 16. The methodof claim 11, further comprising storing the results of evaluating theplurality of classifiers for each element of the table chunk in thesystem memory.
 17. A computer program product for detecting objects inan image comprising program code, which during execution in a computersystem calls a method for detecting variable sized objects in an image,the method comprising the following steps: downscaling the imageaccording to a plurality of scale factors to form a plurality ofdownscaled images, decomposing each downscaled image into a plurality ofequally sized image chunks corresponding to different image subregions,applying a smoothing filter to each image chunk.
 18. The computerprogram product of claim 17, further comprising: calculating a summedarea table from each downscaled image, decomposing each summed areatable into a plurality of equally sized table chunks, evaluating aplurality of classifiers for each element of the table chunk todetermine whether an object is enclosed within the table chunk.
 19. Thecomputer program product of claim 18, further comprising: encoding eachpositive result of evaluating the plurality of classifiers with aspatial reference of the position of the identified object in the image.